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Abstract 

Background: When a large number of candidate variables are present, a dimension reduction procedure is usually 
conducted to reduce the variable space before the subsequent analysis is carried out. The goal of dimension 
reduction is to find a list of candidate genes with a more operable length ideally including all the relevant genes. 
Leaving many uninformative genes in the analysis can lead to biased estimates and reduced power. Therefore, 
dimension reduction is often considered a necessary predecessor of the analysis because it can not only reduce 
the cost of handling numerous variables, but also has the potential to improve the performance of the 
downstream analysis algorithms. 

Results: We propose a TMLE-VIM dimension reduction procedure based on the variable importance measurement 
(VIM) in the frame work of targeted maximum likelihood estimation (TMLE). TMLE is an extension of maximum 
likelihood estimation targeting the parameter of interest. TMLE-VIM is a two-stage procedure. The first stage resorts 
to a machine learning algorithm, and the second step improves the first stage estimation with respect to the 
parameter of interest. 

Conclusions: We demonstrate with simulations and data analyses that our approach not only enjoys the 
prediction power of machine learning algorithms, but also accounts for the correlation structures among variables 
and therefore produces better variable rankings. When utilized in dimension reduction, TMLE-VIM can help to 
obtain the shortest possible list with the most truly associated variables. 



Background 

Gene expression microarray data are typically character- 
ized by large quantities of variables with unknown cor- 
relation structures [1,2]. This high dimensionality has 
presented us challenges in analyzing the data, especially 
when correlations among variables are complex. Includ- 
ing many variables in standard statistical analyses can 
easily cause problems such as singularity and overfitting, 
and sometimes is not even doable. To manage this pro- 
blem, the dimensionality of the data will often be 
reduced in the first step. There are multiple ways to 
achieve this goal. One is to select a subset of genes 
based on certain criteria such that this subset of genes 
is believed to best predict the outcome. This gene selec- 
tion strategy is typically based on some univariate mea- 
surement related to the outcome, such as t-test and 
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rank test [3,4]. Another strategy is to use a weighted 
combination of genes of lower dimension to represent 
the total variation of the data. Representative approaches 
are principle component analysis (PCA) [5] and partial 
least squares (SLR) [6-9]. Machine learning algorithms 
such as LASSO [10,11] and Random Forest [12] have 
embedded capacity to select variables while simulta- 
neously making predictions, and can be used to accom- 
modate high dimensional microarray data. 

As always, there is no one-size-fits-all solution to this 
problem, and one often needs to resort to a mix-and- 
match strategy. The univariate-measurement based gene 
selection is a very popular approach in the field. It is 
fast and scales easily to the dimension of the data. The 
output is usually stable and easy to understand, and ful- 
fills the objectives of the biologists to directly pursue 
interesting findings. However, it often relies on over- 
simplified models. For instance, the univariate analysis 
evaluates every gene in isolation of others, with the 
unrealistic assumption of independence among genes. 
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As a result, it carries a lot of noise and the selected 
genes are often highly correlated, which themselves cre- 
ate problems in subsequent analysis. Also, due to the 
practical limit of the size of the gene subset, real infor- 
mative genes with weaker signals will be left out. In 
contrast, PCA/PLS constructs a few gene components as 
linear combinations of all genes in a dataset. This 
"Super Gene" approach assumes that the majority of the 
variation in the dataset can be explained by a small 
number of underlying variables. One then uses these 
gene components to predict the outcome. These 
approaches can better handle the dependent structure of 
genes and their performances are quite acceptable [13]. 
But it is harder to interpret gene components biologi- 
cally, and to assess the effect of individual genes one 
needs to look at the weight coefficients of the linear 
combination. Machine learning algorithms are very 
attractive variable selection tools to deal with large quan- 
tities of genes. They are prediction algorithms with 
embedded abilities to select gene subsets. However, 
whether or not a gene is chosen by a learning algorithm 
may not be the best measurement of its importance. 
Machine learning algorithms are constructed to achieve 
an optimal prediction accuracy, which often overlooks 
the importance of each variable. Consequently, small 
changes in data or tuning parameters may result in big 
changes in variable rankings and the the selected gene 
subsets are instable. For example, Random Forest, a tree- 
based non-parametric method, has a variable importance 
measurement that greatly contributes to its popularity. 
This measurement is sensitive to the parameter choices 
of trees in the presence of high correlations among vari- 
ables, because different sets of variables can produce 
nearly unchanged prediction accuracy [14,15]. Another 
example is LASSO - one of the most popular regulariza- 
tion algorithms. Assuming a sparse signal, LASSO han- 
dles the high dimensionality problem by shrinking the 
coefficients of most variables towards zero [16]. A recent 
implementation of LASSO is in the GLMNET R package 
[17]. The package uses a coordinate descent algorithm 
and can finish an analysis of 20,000 variables within a few 
seconds. To us, its result is somewhat sensitive to the 
choice of the penalizing parameter A. Different As may 
result in gene subsets with little overlapping. In the mean 
time, variable importance measurements are not readily 
available in LASSO. One can simply rank genes by their 
coefficients, but this can be quite subtle. Although per- 
mutation tests may be used to derive p-values, how to 
perform the permutation is a tricky matter due to selec- 
tion of tuning parameters. For small p-values, it is still 
computationally infeasible. In this paper, inspired by con- 
cepts of counterfactual effects from the causal inference 
literature, we propose a targeted variable importance 
measurement [18,19] to rank genes and reduce the 



dimensionality of the dataset. Counterfactuals are usually 
defined in the context of treatment to disease. It is the 
outcome a patient would have had a treatment been 
assigned differently, with everything else held the same. 
Hence counterfactuals are "counter"-fact and apparently 
impossible to be observed. But it can be estimated statis- 
tically. Suppose that we have an outcome Y, a binary 
treatment A, and the confounding variables W of A, and 
we have worked out correctly an estimate £ of the condi- 
tional expectation of Y given A and W. A common way 
to estimate the counterfactual effect of A is to compute 
the difference between the E(Yi\Ai = 1, Wj) and the 
E(Yi\Ai = 0, Wj) for every observation and then average 
over all observations, referred to as the G-computation 
method [20]. Although counterfactuals may not be com- 
pletely relevant to gene microarray data, thinking about 
the data in this way is very helpful for us to assess the 
importance of a gene. Our VIM definition uses the con- 
cepts of counterfactuals and the estimation framework is 
built on the methodology of targeted maximum likeli- 
hood estimation (TMLE) [21]. By tailoring this recently 
developed technique specifically to gene expression data, 
we hope to introduce to the community an alternative 
strategy to carry out gene selection in addition to current 
methods. Our approach takes the advantage of prediction 
power of learning algorithms while targeting at the indi- 
vidual importance of each variable. Its mathematical 
property has been studied in [22], and we will focus on 
its application. In brief, our approach consists of two- 
stages. In the first stage, we predict the outcome given all 
genes. In the second stage, we improve the first stage by 
modeling the mechanism between an individual gene and 
its confounding variables. Both stages can be very flexible 
ranging from using univariate analysis to refined learning 
algorithms. When machine learning algorithms are used, 
we have the flexibility to determine how to make predic- 
tions without restricting ourselves to explicit models and 
distributions. In the meanwhile, as in the case of the uni- 
variate analysis, we return to a simple and well interpre- 
table measure of the importance of each gene. This 
importance measurement is derived in the presence of 
the confounding variables of a gene, and hence can help 
to exploit the redundant information among correlated 
genes. It is generally also more stable than the variable 
importance produced by machine learning algorithms. In 
addition, our approach provides a simple way for statisti- 
cal inference based on asymptotic theories, and is well 
suited for the exploratory analysis of microarray data. 

Methods 

Suppose the observed data are i.i.d. O, = {Y h A h Wi), 
where Y is a continuous outcome, A is the gene of inter- 
est, W is a set of confounding variables of A, and i = 1, 
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n indexes the observation. Let ^(a) represents the 
variable importance measurement (VIM) of A. One can 
define the VIM of A as the marginal effect of A on the 
outcome Y at value A = a relative to A = 0 adjusted for 
W, and then averaged over the distribution of W [18]: 

*(a) = E W [E{Y\A = a, W) - E(Y\A = 0, W)]. 

Consider the semiparametric regression model: 

E{Y\A, W) = pA+f{W), 

where J[W) is a function of W. With this parameteri- 
zation, we have ^(a) = fia. We can then view P as an 
index of the VIM of A. In the above model, the only 
assumption we make is the linearity of A. The definition 
of the VIM of A is closely related to the definition of 
the counterfactual effect in causal inference [23]. 
Although ft can not be directly interpreted as an causal 
effect without proper assumptions [19], it serves well as 
a surrogate of the magnitude of the causal relationship 
between the outcome and a gene. The motivation of 
this parameterization is that by selecting more causally 
related genes, the resulting prediction function will be 
better generalized to new experiments with the same 
causal relation between the outcome Y and A, but a dif- 
ferent joint distribution of W. If in a next experiment, 
the technology or the sampling population is somewhat 
different, but the causal mechanism is still the same, 
then a prediction function that uses the correlates of the 
true causal variables will perform poorly while a predic- 
tion function using the true causal variables will still 
perform nicely. This idea will be illustrated in our 
simulations. 

Our goal is to estimate /3. In [22], this estimation pro- 
blem was addressed in the framework of targeted maxi- 
mum likelihood estimation (TMLE). TMLE is an 
estimating equation and efficient estimation theory 
based methodology [24], and is particularly useful when 
it comes to semiparametric models. Estimators from the 
traditional method such as MLE perform well for para- 
metric models, however, they are generally biased rela- 
tive to their variances especially when the model space 
is large. This is because the MLE focuses on doing a 
good job on the estimation of the whole density rather 
than on the parameter itself. TMLE is designed to 
achieve an optimal trade-off between the bias and the 
variance of the estimator. It uses an MLE framework, 
but instead of estimating the overall density, TMLE tar- 
gets on the parameter of interest and produces estima- 
tors minimally affected by changes of the nuisance 
parameters in a model. In Additional File 1 we provide 
a brief overview of this methodology with a demo simu- 
lation example. The formal mathematical formulation of 
TMLE can be found in the original paper by van der 



Laan and Rubin [21]. The implementation of TMLE to 
estimate /3 is fairly simple and consists of two stages. 
First, we estimate E(Y \A, W) without any parametric 
restriction. We then regress the residual of Y and E(Y \ 
A = 0, W) onto /3A to conform with our semiparametric 
regression model. This will yield an initial estimator of 
P and fitted values of E(Y \A, W), denoted by and 
the q( 0 )- In the second stage, we update these initial 
estimates in a direction targeted at [}. This involves 
regressing the residuals of Y and the fitted qM on the 
clever covariate A - E(A\W). The E(A\ W) evaluates the 
confounding of A with W, and we name it the "gene 
confounding mechanism". It needs to be estimated if 
unknown. Let us denote the coefficient before the clever 
covariate as e. The updated TMLE estimate of /3 is 
fl(°) + £fl , where e„ is the estimated value of e. The var- 
iance estimate of /? can be computed from its efficient 
influence curve. Below is a step-by-step implementation 
of our algorithm, and we refer to it as the TMLE-VIM 
procedure. 

1. Obtain the initial estimator q^ 0 ) and Use 
your favorite algorithm here, for example, linear 
regression, LASSO, Random Forest, etc. 

2. Obtain the g„{W) estimate for the gene confound- 
ing mechanism E(A\ W). As in the case of Q„(0), a 
broad spectrum of algorithms can be used. In this 
paper, we use LASSO (in the GLMNET R package) 
for its optimal speed. 

3. Compute the "clever covariate": 

r{A, W)=A-g n {W). 

4. Fit regression y> = y _ q(°> ~ sr (A, W) 

5. Update the initial estimate y^ 0 ' with 

Pn^ = Pn^ + e n/ 

and update the initial fitted values with 

Q [ n ] = Q [ n ] + £nr{A, W). 



6. Compute the variance estimate a„ for ^gW accord- 
ing to its efficient influence curve: 
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where i indexes the i-th observation. 

7. Construct the test statistic: 

M) 

T (A) = 

T(A) follows the standard Gaussian distribution under 
the null hypothesis [} = 0 when the sample size n goes 
to infinity. 

The TMLE estimator is a consistent estimator of 
j3 when either the q(°' or the g„{W ) is consistent. 
When the is consistent, it is also asymptotically effi- 
cient. The derivations of the clever covariate, the effi- 
cient influence curve, the TMLE estimate and its 
mathematical properties can be found in [22] and [18]. 
Upon the construction of the test statistic, a p-value can 
be calculated for the adjusted marginal effect of A and 
used as an index of the variable importance. 

In the application to dimension reduction, for each 
variable in the dataset, we compute a TMLE-VIM p- 
value. We then reduce our variable space based on 
these p-values. There are two notions. First, in principle, 
a separate initial estimator q(°) should be fitted for 
every gene A by forcing A as a term in the algorithm 
used. This can become quite time consuming. To solve 
the problem, instead of estimating E(Y \A, W) for each 
A, we obtain a grand estimate G„(V) for E(Y \ V). Here 
V represents all variables in the dataset. Then for every 
A in V, we carry out the regression Y ~ fiA with the off- 
set G„(A = 0) to get /J„(0) and q(°1 Second, when 
obtaining the g n {W), we want to be attentive to how clo- 
sely W is correlated with A. The independence between 
W and A results in zero adjustment to the initial estima- 
tor, while a complete association causes fi to be uniden- 
tifiable. A simple option is to use all variables less than 
a pre-defined correlation with A as W. In [22], they 
authors suggest 0.7 as a conservative threshold. Instead 
of applying a universal cutoff, we can also set individua- 
lized correlation threshold for each A. Below we provide 
a data adaptive procedure to do it. One first defines a 
sequence of correlation cutoffs <5. For each choice of S, 
one computes the corresponding TMLE p-value for A. 
One then sets a p-value threshold A, and chooses the 
maximum 5 that has produced a p-value less than A. 
The degree of the protection is determined with the 
value of A. In general, the smaller the A is, the more the 
protection. The value of A can be either fixed a priori or 
chosen by cross validation. We refer to it as the TMLE- 
VIM(A) procedure. It allows us to adjust for the con- 
founding in the dataset adaptively and flexibly, and pro- 
tect the algorithm against the harm from high 
correlations among variables. It works best when many 



variables are closely correlated in a complex way. How- 
ever, it does require more time to run, especially when 
A needs to be chosen by cross validation. In many cases, 
a universal cutoff of 0.7 will work fine. In Additional 
File 1 we provide the mathematical formulation of the 
TMLE-VIM(A) procedure. Once we have all the vari- 
ables ranked by their p-values. the candidate list can be 
truncated by either applying a p-value threshold or tak- 
ing the top k ranked variables. Both of them are sound 
practices. In our simulations and data analysis, we 
usually truncate the list at a p-value threshold 0.05. 

Results and Discussion 

Simulation studies 

We performed two sets of simulations. The first set of 
simulations investigates how TMLE-VIM responds to 
changes in the number of confounding variables, the 
correlation level among variables, and the noise levels. 
The second set studies the TMLE-VIM with more com- 
plex correlation structures and model misspecification. 
The performance of the dimension reduction procedure 
was primarily evaluated by the achieved prediction accu- 
racy using a prediction algorithm on the reduced sets of 
variables, illustrated in the following analysis flow: 

Compute VIM Reduced variables -> Prediction Algorithm. 

Two prediction algorithms, LASSO and D/S/A (Dele- 
tion/Substitution/Addition) [25], were used. D/S/A 
searches through the variable space and selects the best 
subset of covariates by minimizing the cross validated 
residual sum of squares. In our simulations, LASSO and 
D/S/A predictions are often similar. We used D/S/A in 
simulation I as it provides convenience to count what 
variables are included in the prediction model. LASSO 
was used in simulation II for its faster speed. We also 
used multivariate linear regression (MVR) as a compari- 
son to machine learning algorithms when applicable. 
Part I 

In simulation I, we varied the number of non-causal 
variables (W), the correlation coefficient p among vari- 
ables, and the noise level af to see how TMLE-VIM 
responds to them. For each simulated observation O, = 
(Y it Ai, Wi), where i indexes the i-th sample, the out- 
come Y) was generated from a main effect model of 25 
As: 

25 

Y i = 2^A J1 + e i , 
;=i 

where / indexes the y'-th A, and e t is a normal error 
with mean 0 and variance cr e 2 . Each Aj was correlated 
with m Ws, and hence the total number of Ws is m w = 
25 m. Aj and its associated W s were jointly sampled 
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from a multivariate Gaussian distribution with mean 0 
and variance-covariance matrix S, where S is a correla- 
tion matrix with an exchangeable correlation coefficient 
p. This simulation scheme resulted in 25 independent 
clusters of covariates. Within each cluster, the covariates 
are correlated at level p. 
Simulations were run for combinations of: 

♦ m = (10, 20) corresponding to m w = (250, 500); 

. a e = (1, 5, 10); 

. and p = (0.1, 0.3, 0.5, 0.7, 0.9). 

For each combination, we simulated a training set of 
500 data points and a testing set of 5000 data points. 
The training set was used to obtain the prediction 
model while the testing set was used to calculate the L2 
risk. We also calculated a cross-examined L2 risk using 
a testing set with a p other than that of the training set. 
This is to demonstrate that by identifying more causally 
related variables, TMLE-VIM is robust to the change of 
the joint distribution among the covariates As and Ws. 
In specific, for each prediction model obtained from a 
training set, we calculated the L2 risk on the testing set 
generated with p = 0.1 regardless of what p was used to 
generate the training set. As a benchmark, we also used 
univariate regression in parallel with TMLE-VIM to 
reduce the dimensionality of the dataset, denoted with 
UR-VIM. Once the variable importance was calculated, 
we cut short the variable list using a p-value threshold 
0.05. Each combination was replicated 10 times and 
results took the average. 

TMLE-VIM used LASSO to obtain both the initial 
estimator q' 0 ' and the gene confounding mechanism 
estimator g n ( W ). In the g n (W ), W was all the variables 
excluding A. TMLE-VIM has demonstrated a consis- 
tent advantage over UR-VIM with respect to the final 
prediction error over a range of simulation settings. 
This is particularly the case when the joint distribution 
of the covariates changes and when predictions were 
made by MVR that lacks internal capacity of model 
selection. Smaller <7 e , larger m w , and larger p tend to 
magnify this advantage. Also, TMLE-VIM risks have 
smaller standard errors than the UR-VIM risks. In 
Table 1, we present our simulation results for five dif- 
ferent p values and two different m w values, with aj 
fixed at 5. The following summary quantities are 
reported: 

. R r = (UR-VIM risk - TMLE-VIM risk)AJR risk: the 
proportion of the risk reduction of TMLE-VIM rela- 
tive to the UR-VIM risk. It measures by how much 
TMLE-VIM outperforms UR-VIM. The bigger the 
number, the more the advantage. 



Table 1 The simulation I results 



p 




m„ 


= 250 


m w 


= 500 






MVR 


DSA 


MVR 


DSA 




ft, 


0.2341 ; 


0.2436 ; 


0.4035 ; 


0.4230 ; 






02251 


02351 


03784 


0 3943 


0.1 


Ra 


1 .0870 




1 .21 36 






Rw 


0.6522 




0.8846 






RRdsa 


na 


1.0130 


na 


1 .0680 




Rr 


0.2202 ; 


0.2231 ; 


0.2341 ; 


0.2027 ; 






02297 


02247 


02307 


0 1975 


0.3 


Ra 


1 .0776 




1 .0684 






Rw 


0.1528 




0.0958 






RRdsa 


na 


1 .0345 


na 


1 .0299 




Rr 


0.2425 ; 


0.1169 ; 


0.4883 ; 


0.1268 ; 






031 15 


0 1285 


0 5959 


01217 


0.5 


Ra 


1 .0373 




1 .0331 






Rw 


0.0355 




0.0149 






RRdsa 


na 


1.0251 


na 


1 .0335 




Rr 


0.3599 ; 


0.1307 ; 


0.8001 ; 


0.1740 ; 






0.5872 


0.2545 


0.9093 


0.2976 


0.7 


Ra 


1.0081 




1 .0000 






Rw 


0.0275 




0.0162 






RRdsa 


na 


1 .0693 


na 


1.1055 




Rr 


0.2262 ; 


-0.1364 ; 


0.9390 ; 


-0.5498 ; 






0.7248 


0.2805 


0.9885 


0.2657 


0.9 


Ra 


0.8415 




0.5502 






Rw 


0.0364 




0.0204 






RRdsa 


na 


1.2630 


na 


1.6103 



Bold fonts: testing set (a). Italic fonts: testing set (b). 

na: not available. -: the same value as the previous entry. 



.R A = TMLE-VIM N A /VR-VIM N A : the ratio of the 
number of ,4s {N A ) in the TMLE-VIM list to the 
number of As in the UR-VIM list. 
>R W = TMLE-VIM AV/UR-VIM N w : the ratio of 
the number of W s (N w ) in the TMLE-VIM list to 
the number of Ws in the UR-VIM list. 
'RRdsa = TMLE-VIM P A /VR-VIM P A : the ratio of 
the proportion of As (P A ) in the final D/S/A predic- 
tion model resulted from the TMLE-VIM procedure 
to that from the UR-VIM. It measures the relative 
chance of arriving at a truly associated variable in 
the final model through the path of TMLE-VIM, 
referenced to the UR-VIM. 

The R r was calculated on two different testing sets. 
One is the testing set generated with the same p as the 
corresponding training set, and we refer it to "testing set 
(a)"; the other is the testing set generated with p = 0.1, 
and we refer it to "testing set (b)". Testing set (a) shares 
the same correlation structure as the training set, while 
in testing set (b) all the variables are essentially indepen- 
dent of each other. Testing set (b) is a simple represen- 
tation of the scenario that when a new experiment is 
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conducted the overall joint distribution of the covariates 
changes while the causal mechanism remains the same. 
In Table 1, the bold R r was calculated on testing set (a), 
and the Italic R r was on testing set (b). We make a few 
points here about Table 1: 

♦ The proportion of the risk reduction (R r ) of the 
TMLE-VIM relative to the UR-VIM is typically 
more than 20% for the MVR prediction and 10% for 
the D/S/A prediction. In some cases, the risk reduc- 
tion of the MVR can be very significant. For exam- 
ple, when m w = 500 and p = 0.7, the TMLE-VIM 
risk is close to only half of the UR-VIM risk. TMLE- 
VIM tends to deliver more advantages when m w = 
500 than when m w = 250. When the correlation 
coefficient p increases, the TMLE-VIM performs 
increasingly better than the UR-VIM for the MVR 
prediction. For the D/S/A prediction, small or large 
ps seem to benefit most from the TMLE-VIM. For 
intermediate p, the benefit is still there but reduced. 
We believe that how much the risk can be reduced 
by the TMLE-VIM is a combination of factors such 
as the number of As and Ws in the reduced candi- 
date list, the correlation structures among covariates 
and the internal optimization procedures of D/S/A. 
The advantage of the TMLE-VIM over the UR-VIM 
does seem to be more significant on the testing set 
(b) than the testing set (a), in support of our hypoth- 
esis that by identifying more causally related vari- 
ables the TMLE-VIM results generalize better to 
new experiments. 

♦ Most R A values are slightly higher than 1 while the 
R w values are much smaller. This indicates that on 
average, in the TMLE-VIM list, the number of cor- 
rectly identified As is slightly higher than that in the 
UR-VIM list, while the number of falsely associated 
W s is much less. It is especially the case when the 
correlations are high among variables. The low 
counts of false positives is a major contributing fac- 
tor that the prediction made on the TMLE-VIM 
candidate list is better than that on the UR-VIM. 

♦ As to the number of As that are finally made into 
the D/S/A prediction model, the TMLE-VIM in 
most cases displays a slight advantage over the UR- 
VIM. A closer look reveals that the variables 
included in the D/S/A model only differs by one or 
two between the TMLE-VIM and the UR-VIM. But 
the prediction risk has a measurable difference. This 
probably implies that every single variable counts in 
making good predictions in these simulations. 

♦ When p = 0.9, the situation seems to be losing its 
track. The TMLE-VIM did worse than the UR-VIM 
in terms of correctly identified variables as well as 
the prediction risk of the testing set (a). Considering 



the high correlations among variables, this could 
possibly be attributed to the overfitting in the g n {W). 
Indeed, in [22], the authors showed that TMLE dete- 
riorates when adjusting for variables with correlation 
coefficients beyond 0.7. However, the RRdsa indi- 
cates that the chance of including a correct variable 
in the final D/S/A model based on the TMLE-VIM 
list is higher than that on the UR-VIM. Further 
looking into the data, we found out that the number 
of As that made into the D/S/A model from the 
TMLE-VIM list is actually greater than that from 
the UR-VIM, while the number of Ws is much less. 
Henceforth, the D/S/A model built on the TMLE- 
VIM list is closer to truth, but somehow its predic- 
tion is worse than the model built on the UR-VIM 
list. This seems to suggest that when provided with 
the UR-VIM list, the D/S/A has offset its model for 
the missed As from highly correlated Ws, while for 
the TMLE-VIM, this can not be done since there 
are not many Ws in the list. It is the same reason 
that the UR-VIM underperforms the TMLE-VIM on 
the testing set (b) when those surrogates of As were 
lost. For the MVR, although the TMLE-VIM shows 
a dominant advantage over the UR-VIM with respect 
to the prediction accuracy, the TMLE-VIM only 
identified 77% (when m w = 250) and 57% (when m w 
= 500) of the As identified by the UR-VIM. The bet- 
ter prediction is merely due to the fact that the 
MVR breaks down when too many variables entered 
the model. This is particularly the case when m w = 
500. 

Figure 1 presents a graphical representation of a typi- 
cal example in simulation I with (a e = 5, m w = 250, p = 
0.7). Besides the advantage displayed by the TMLE-VIM 
relative to the UR-VIM, we also see much smaller differ- 
ences between the TMLE-VIM risks of the testing set 
(a) and (b) compared to the UR-VIM because TMLE- 
VIM was able to detect more As. In summary, when 
confounder are properly adjusted in g n (W), TMLE-VIM 
improves not only the performance of relatively simple 
algorithms such as the MVR, but also the more complex 
learning algorithms with built-in capacities of variable 
selection. Interested readers can find all the original pre- 
diction risks and counts of As and Ws and their stan- 
dard errors in Additional File 2 for this simulation. 
Part II 

Simulation II examines the TMLE-VIM on larger-scale 
datasets with much more complex correlation struc- 
tures. The simulation consists of 500 samples and 1000 
variables. We used a correlation matrix derived from the 
top 800 genes in a real dataset published in [26]. For 
these genes, the median absolute correlation coefficient 
was centered at 0.26, the lst/3rd quartile being 0.16/ 
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P-value threshold (on Iog10 scale) 

Figure 1 A typical example in simulation I. This graph presents the average L2 risk of the final prediction model on the candidate list from 
the UR-VIM and the TMLE-VIM, for simulation I data with setup (a e = 5, m w = 10, p = 0.7). In the left panel, the MVR risk is plotted against a 
series of p-value thresholds used to truncate the candidate list; the right panel plots the D/S/A risk. The testing set (a) predictions are grouped in 
solid blue lines, and the testing set (b) predictions are grouped in broken orange lines. Dots represent the UR-VIM, and triangles represent the 
TMLE-VIM. 



0.37, and the maximum as high as 0.9977. Hence, simu- 
lation II tried to mimic the correlation structure in this 
real data set. The outcome Y was generated from two 
different models using 20 As. One is a linear model, and 
the other is polynomial. 

Details of this simulation is provided in the Additional 
File 1. A test dataset of 5000 points were simulated to 
assess the L2 prediction risk. We repeated the simulation 
for 10 times and results took the average. In TMLE-VIM, 
we tried two different initial estimators. One is the univari- 
ate regression as simple as Y ~ A, and the other is the 
LASSO estimator. LASSO was also used to get the g n (W) 
and to make the final predictions, we adjusted universally 
in the g n ( W ) for the variables that are correlated with A 
with an arbitrary correlation coefficient less than 0.7. All 
the q(°) and g tl (W) models were main-term linear. Hence, 
with the polynomial outcome, we could examine how 
TMLE-VIM performs when mis-specified models were 
provided. To summarize the result, we computed a R 



TMLE - VIM(Q, ( 1 0) 
TMLE - VIM(Qi 0) 



quantity, representing the proportion of explained variance 
relative to an intercept model. It is defined as 1 - mean 

1 - 2 

risk/MST, where MST = - J\ - Y) and y is the 

n 

mean of Y. Table 2 lists the R 2 and the number of true 
positives (T.P.) and false positives (F.P.) in the reduced list 
of candidate variables, for the UR-VIM, the 
UR> and the 

LASSO)- Compared to UR-VIM, 
TMLE-VIM improved the prediction risk by providing 
LASSO a candidate list with more truly and less falsely 
associated variables for both the linear and polynomial 
simulations. It is worth noting that even with an initial 
estimator as simple as the univariate regression 
(Qn°^ = UR)> TMLE-VIM still achieves a significant 
increase in R 2 by modeling the g n ( W). 

The numbers in Table 2 were based on candidate lists 
that were cut short with a p-value threshold of 0.05. In 
Table 3, we provide the results based on the top 100 
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Table 2 The simulation II results (p-value) 



Linear 



R 2 T.P 

UR-VIM 0.2887 13.8 

TMLE - VIM(Q° = UR) 0.4849 16.6 

TMLE - VIM(Q° = LASSO) 0.6289 19.7 

TMLE-VIMft) 0.6479 20 



The candidate variable list contains all variables with p-values less than 0.05. 

ranked genes. The numbers of UR-VIM and the 
TMLE - VIM(Q^ = UR) are less satisfying than those in 
Table 2, while the TMLE - VIM(Qj, 0) = LASSO) achieved 
comparable results. This suggests that the 
TMLE - VIM(oJ,°' = LASSO) p-values of As are among 
the smallest ones, and shortening the length of the list 
does not affect the final result. Regardless of the wea- 
kened results, The TMLE - VIM^^ 0 ' = UR) st iU displays 
a non-ignorable advantage over the UR-VIM with 
respect to the prediction accuracy, while the number of 
correctly identified As is slightly smaller than that of the 
UR-VIM. We then looked at the correlation matrix 
among the top 100 selected genes, and it occurs that 
the correlation among them is the least for the 
TMLE - VIM(Q° = LASSO), the most for the UR-VIM, 
and the TMLE - VIM(q1 0) = UR) lies in between. This 
could explain why the TMLE - VIM(Q° = UR) does a 
better job in prediction regardless of less As. 

We also carried out the TMLE-VIM(A) procedure 
with LASSO as the initial estimator, allowing the 
data select the correlation cutoff for variables to be 
adjusted in the g n {W ). Results are also reported in 
Table 2 and Table 3. TMLE-VIM(A) identified more 
^4s but also more Ws, and the prediction accuracy is 
only slightly improved. On the other hand, the corre- 
lations among the selected top 100 variables are 
quite small. It seems by data adaptively adjusting for 
the correlation levels in the g n (W), TMLE-VIM(A) 
returns a more independent set of genes. The actual 
risks and standard errors are contained in Additional 
File 2. 



Simulation 



Polynomial 


F.P. 


R 2 


T.P. 


F.P. 


605.3 


0.1851 


13.4 


555.9 


280.5 


0.3245 


14.7 


255.5 


29.1 


0.4203 


17.9 


24 


41.6 


0.4498 


19.2 


105.9 



Data Analysis 

Breast cancer patients are often put on chemotherapy 
after the surgical removal of the tumor. However not all 
patients will respond to chemotherapy, and proper gui- 
dance for selecting the optimal regimen is needed. Gene 
expression data have the potential for such predictions, 
as studied in [26]. The dataset from [26] contains the 
gene expression profiling on 22283 genes for 133 breast 
cancer patients. The outcome is the pathological com- 
plete response (pCR). This is a binary response asso- 
ciated with long-term cancer free survival. There are 
also 13 clinical variables collected in the dataset includ- 
ing the ER (estrogen receptor) status, which is a very 
significant clinical indicator for chemotherapy response. 

The goal of the study is to select a set of genes that 
best predict the clinical response pCR. The first step is 
to reduce the number of genes worth of consideration, 
and we applied both UR-VIM and TMLE-VIM (with Q 
<0) = UR and Q (0) = LASSO) for this purpose. For the 
TMLE-VIM(Q (0) = LASSO), the was estimated by 
LASSO using the top 5000 ranked genes. We then took 
all the genes with the FDR-adjusted p-values less than 
0.005 [27], as suggested in the original paper, and upon 
them we built a predictor using the Random Forest 
(tuning parameters mtry = number of variables/3, ntree 
= 3000 and nodesize = 1). The clinical covariates were 
treated in the same way as genes. To prevent the algo- 
rithm from breaking down, we only adjusted for the 
confounder with correlation coefficients less than 0.7 
with A in the g„{W ). We carried out a 10-fold honest 
cross validation. We divided the dataset into 10 subsets. 
Each subset was regarded as a validation set and the 



Table 3 The simulation II results (top 100) 



UR-VIM 

TMLE - VIM(Q° = UR) 
TMLE - VIM(Q° = LASSO) 

TMLE-VIMft) 



R 

0.1444 
0.1907 
0.6059 
0.5916 



Linear 



T.P 

9.0 



20 



Simulation 



Polynomial 



cor. 

0.2956 
0.2534 
0.2289 
0.1242 



R 

0.0862 
0.1605 
0.4132 
0.3859 



T.P. 

8.2 
7.2 
19.2 
17.7 



cor. 

0.3642 
0.2590 
0.2234 
0.0867 



The candidate variable list contains the top 100 variables ranked by their p-values. 
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Table 4 The analysis result of the breast cancer dataset 





Num. of genes in the candidate list 


C.V. classification accuracy 


Corr. level among the top 100 genes 


UR-VIM 

TMLE - VIM(Q° = UR) 
TMLE - VIM(Q° = LASSO) 


327 
660 
818 


07669 
07744 
07744 


043 
0.18 
0.21 



rest as the training set. We reperformed the entire ana- 
lysis, i.e. VIM calculation — » dimension reduction — > 
Random Forest classifier, on all 10 training sets and pre- 
dicted the outcome of the validation set using the classi- 
fier built on the training samples. We can then use 



these cross validated predictions to assess the true clas- 
sification accuracy of our algorithm. 

Analysis results are tabulated in Table 4. The UR-VIM 
produced a candidate list of 326 genes and one clinical 
variable the "ER status", while the list of the TMLE- VIM 
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(Q (0) = UR) consists of 660 genes and TMLE-VIM(Q (0) = 
LASSO) 818 genes. The TMLE-VIM identified many 
more genes than the UR-VIM. Among all the identified 
genes, 429 overlap between the TMLE - VIM(q[, 0) = LASSO) 
and TMLE - VIM(Qj, 0) = UR> 15 overlap between the UR- 
VIM and TMLE-VIM(Q <0) = UR), 10 overlap between the 
UR-VIM and TMLE-VIM(Q (0) = LASSO), and only 4 
genes are shared among all three (please see Figure 2). 
The TMLE-VIM appeared to have selected almost a differ- 
ent set of genes than the UR-VIM. 

The TMLE-VIM(Q (0) = UR) and the TMLE-VIM(Q (0) 
= LASSO) results are quite similar to each other regard- 
less of the adequate difference between the initial esti- 
mators. It seems the modeling of the g n (W ) had played 
a significant role and steered away the initial univariate 
estimates. Further investigation found out that genes in 
the UR-VIM list are highly correlated with the clinical 
indicator ER status, while the TMLE-VIM genes are not. 
Consequently, the TMLE-VIM genes are less correlated 
to each other than the UR-VIM genes. Looking at the 
first 100 ranked genes, the absolute median of the corre- 
lation coefficients for the UR-VIM is 0.43, while for the 
TMLE-VIM, it is about half of that number. Although 
the input variables to the Random Forest are different, 
The cross validated (CV) classification accuracy are 
quite similar among these three methods. We also 
passed all 22,000 genes to Random Forest and looked at 
its variable importance measurement. The Random For- 
est VIM (RF-VIM) is more similar to the UR-VIM: 
about 50% of them overlap but only a few overlap with 
the TMLE-VIM. The RF-VIM genes are also highly cor- 
related with the ER status, albeit the less severity than 
the UR-VIM. Its OOB classification accuracy (0.7669) is 
comparable with all three other methods. 

In summary, the UR-VIM and RF-VIM seemed to 
have identified genes that are strong predictors of the 
clinical variable ER status. The ER status is a strong 
indicator of the outcome pCR. Hence, the final predic- 
tion accuracy still seems quite good. The TMLE-VIM 
has identified a list of genes of which a small proportion 
is strong predictors of ER status and others are not 
associated with the ER status. Its prediction accuracy is 
slightly better than that of the UR-VIM and RF-VIM. 

Conclusions 

We have shown in this paper with extensive simulations 
that the TMLE based variable importance measurement 
can be incorporated into a dimension reduction proce- 
dure to improve the quality of the list of the candidate 
variables. It requires an initial estimator q(°) and a gene 
confounding mechanism estimate g n {W). A consistent 
q(°) ensures the consistency and the efficiency of the 
TMLE estimate. When q' 0 ' is not consistent, a correct 



specification oi g n (W) can still produce consistent esti- 
mates while that estimate will not be efficient any more. 
We generally recommend to do as good a job as we can 
on obtaining the as a better means both a 

smaller bias and a smaller variance. Nevertheless, algo- 
rithms as simple as univariate regression are also valid 
choices, and in this case, we will rely solely on the good- 
ness of g„(W ). The computation of g„(W ) directly 
affects the speed of the TMLE-VIM, as it has to be 
redone for every variable. Hence, one may want to 
choose an approach that is reasonably fast. In our study, 
we chose the GLMNET R Package as our primary tool 
to get g n { W ), and it worked very well. In practice, one 
needs to balance the resources used for the initial esti- 
mator and the gene confounding mechanism. With a 
proper design of the two estimating stages, TMLE-VIM 
is a fairly fast procedure. It is also worth mentioning 
that the TMLE-VIM can sometimes be sensitive to the 
overfitting in the and hence, caution needs to be 
exercised when choosing an aggressive algorithm. 

A popular dimension reduction approach is the princi- 
ple component analysis (PCA). The PCA computation 
does not involve the outcome, and so it could be less 
powerful when prediction is the primary goal. Its output 
is a linear combination of all the genes. Though not a 
gene selection approach, we still carried it out on our 
simulation I data as an interesting comparison to our 
approach. PCA demonstrates an intermediate perfor- 
mance with respect to the UR-VIM and the TMLE-VIM 
on small p-value cutoffs. This means a few top compo- 
nents carry all the prediction power. When the p-value 
cutoff is increased, and more components enter the can- 
didate list, its results became quite unsatisfying. When 
the correlation structure changes among the genes, PCA 
has done a poor predicting job. The PCA results are 
contained in Additional File 3. 

Usually, the reduced set of variables will serve as the 
input of a prediction algorithm to build a model. Such 
algorithms used in this article include MVR, LASSO, 
and D/S/A. We have noticed that in most of our simula- 
tions, the MVR prediction often achieves a similar risk 
as LASSO and D/S/A on the TMLE-VIM reduced set of 
variables. It suggests that further variable selection may 
not be necessary for the TMLE-VIM candidate list, and 
we can use simpler algorithms to get a good prediction. 
In fact, the TMLE-VIM can go beyond the scope of 
dimension reduction. It can be iteratively applied to the 
data until it converges to a list of several variables that 
are most likely to be causal to the outcome. In this case, 
one may want to use the Super Learner [28] as the pre- 
diction algorithm, which works more effectively with the 
TMLE-VIM. The Super Learner is an ensemble learner 
that combines predictions from multiple candidate 
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learners with optimal weights. It has been shown in [29] 
that the Super Learner performs asymptotically equal to 
or better than any of its candidate learners. The Super 
Learner allows the data to objectively blend results from 
different algorithms rather than relying on a single algo- 
rithm chosen subjectively by an analyst. Hence it enjoys 
a greater flexibility to explore the model space and 
usually produces reasonable predictions consistently 
across a wide variety of datasets, and serves as a very 
good prediction algorithm for the TMLE-VIM. On the 
other hand, it is also more computationally demanding. 

TMLE-VIM is a quite general approach. Besides gene 
expression data, TMLE-VIM can also be applied to 
genetic mapping problems. The genome-wide association 
studies (GWAS) can involve more than a million of 
genetic markers. In this case, only the univariate analysis 
seems to be feasible of ranking every marker. With the 
TMLE-VIM procedure, we can run more complex algo- 
rithms on a subset of top ranked markers, taking it as the 
initial estimator, and then evaluate every single marker. 
The variable importance of each marker is thus obtained 
through a multi-marker approach and being adjusted for 
its confounder. However, the GWAS in human beings is 
usually case-control data, and the current TMLE-VIM 
needs to be extended to accommodate such outcomes. 

Additional material 



Additional file 1: More detailed descriptions of the TMLE 
methodology and the conducted simulations 

Additional file 2: The additional materials of the conducted 
simulations 

Additional file 3: The PCA results 
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